Introduction

Dataset from the U.S. Small Business Administration (SBA). For this case-study assignment, students assume the role of loan officer at a bank and are asked to approve or deny a loan by assessing its risk of default using logistic regression.

Background and Description of Datasets

The U.S. SBA was founded in 1953 on the principle of promoting and assisting small enterprises in the U.S. credit market. Small businesses have been a primary source of job creation in the United States; therefore, fostering small business formation and growth has social benefits by creating job opportunities and reducing unemployment. One way SBA assists these small business enterprises is through a loan guarantee program which is designed to encourage banks to grant loans to small businesses. SBA acts much like an insurance provider to reduce the risk for a bank by taking on some of the risk through guaranteeing a portion of the loan. In the case that a loan goes into default, SBA then covers the amount they guaranteed.

There have been many success stories of start-ups receiving SBA loan guarantees such as FedEx and Apple Computer. However, there have also been stories of small businesses and/or start-ups that have defaulted on their SBA-guaranteed loans. The rate of default on these loans has been a source of controversy for decades. Conservative economists believe that credit markets perform efficiently without government participation. Supporters of SBA-guaranteed loans argue that the social benefits of job creation by those small businesses receiving government-guaranteed loans far outweigh the costs incurred from defaulted loans.

Since SBA loans only guarantee a portion of the entire loan balance, banks will incur some losses if a small business defaults on its SBA-guaranteed loan. Therefore, banks are still faced with a difficult choice as to whether they should grant such a loan because of the high risk of default. One way to inform their decision making is through analyzing relevant historical data such as the datasets provided here.

https://amstat.tandfonline.com/doi/full/10.1080/10691898.2018.1434342#.W4TAIi2B00o

In [1]:
from IPython.core.display import display, HTML
display(HTML("""<style> .container {width:96% !important;}</style>"""))

from IPython.display import IFrame
In [2]:
import sys
sys.path.insert(0,'../')
from utils.paths import *
In [3]:
import pandas as pd
import numpy as np
# from plotly.offline import init_notebook_mode, iplot
# import cufflinks as cf
# init_notebook_mode()
# cf.go_offline()
from __future__ import division
In [4]:
path_data = '../large_data_files/ASA_loan_data/'
def table(no):
    # there are 13 additional table
    return pd.read_csv(path_SBA + 't00%02d'%no + '-10.1080%2F10691898.2018.1434342.csv')
table(1)
In [5]:
def to_float(x):
    x = x.replace('$', '')
    x = x.replace(',', '')
    x = float(x)
    return x

Read data

In [7]:
nat = pd.read_csv(path_SBA + 'SBAnational.csv', low_memory=False)
nat = nat[nat.ApprovalFY != '1976A']
nat['ApprovalFY'] = nat.ApprovalFY.astype(int)
# nat = nat[(nat.ApprovalFY >= 1990) & pd.notnull(nat.MIS_Status)]
nat = nat[pd.notnull(nat.MIS_Status)]
nat = nat[pd.notnull(nat.Name)]

# convert to timestamp
nat['ApprovalDate'] = pd.to_datetime(nat['ApprovalDate'], errors = 'coerce')
nat['DisbursementDate'] = pd.to_datetime(nat['DisbursementDate'], errors = 'coerce')

# convert $ to float
nat['DisbursementGross'] = nat['DisbursementGross'].apply(to_float)
nat['BalanceGross'] = nat['BalanceGross'].apply(to_float)
nat['ChgOffPrinGr'] = nat['ChgOffPrinGr'].apply(to_float)
nat['GrAppv'] = nat['GrAppv'].apply(to_float)
nat['SBA_Appv'] = nat['SBA_Appv'].apply(to_float)

nat.shape
Out[7]:
(897137, 27)
In [ ]:
nat
In [8]:
nat.head()
Out[8]:
LoanNr_ChkDgt Name City State Zip Bank BankState NAICS ApprovalDate ApprovalFY ... RevLineCr LowDoc ChgOffDate DisbursementDate DisbursementGross BalanceGross MIS_Status ChgOffPrinGr GrAppv SBA_Appv
0 1000014003 ABC HOBBYCRAFT EVANSVILLE IN 47711 FIFTH THIRD BANK OH 451120 1997-02-28 1997 ... N Y NaN 1999-02-28 60000.0 0.0 P I F 0.0 60000.0 48000.0
1 1000024006 LANDMARK BAR & GRILLE (THE) NEW PARIS IN 46526 1ST SOURCE BANK IN 722410 1997-02-28 1997 ... N Y NaN 1997-05-31 40000.0 0.0 P I F 0.0 40000.0 32000.0
2 1000034009 WHITLOCK DDS, TODD M. BLOOMINGTON IN 47401 GRANT COUNTY STATE BANK IN 621210 1997-02-28 1997 ... N N NaN 1997-12-31 287000.0 0.0 P I F 0.0 287000.0 215250.0
3 1000044001 BIG BUCKS PAWN & JEWELRY, LLC BROKEN ARROW OK 74012 1ST NATL BK & TR CO OF BROKEN OK 0 1997-02-28 1997 ... N Y NaN 1997-06-30 35000.0 0.0 P I F 0.0 35000.0 28000.0
4 1000054004 ANASTASIA CONFECTIONS, INC. ORLANDO FL 32801 FLORIDA BUS. DEVEL CORP FL 0 1997-02-28 1997 ... N N NaN 1997-05-14 229000.0 0.0 P I F 0.0 229000.0 229000.0

5 rows × 27 columns

In [9]:
nat = nat.sort_values('ApprovalFY')
nat.head()
Out[9]:
LoanNr_ChkDgt Name City State Zip Bank BankState NAICS ApprovalDate ApprovalFY ... RevLineCr LowDoc ChgOffDate DisbursementDate DisbursementGross BalanceGross MIS_Status ChgOffPrinGr GrAppv SBA_Appv
49244 1380800010 TRYON COATS & LEATHER JOHNSTOWN NY NaN 0 KEYBANK NATIONAL ASSOCIATION NY 0 2066-05-18 1966 ... N N 29-Mar-90 2066-08-16 60000.0 0.0 CHGOFF 6084.0 60000.0 54000.0
697801 7245920005 PRINCESS MANUFACTURING CO INC AUXIER KY KY 41602 CITIZENS NATL BK OF PAINTSVILL KY 0 1968-02-09 1968 ... N N 6-Jan-89 1968-03-19 75000.0 0.0 CHGOFF 40110.0 75000.0 56250.0
732056 7616070010 CURTIS JOHNSON GENL BLDG CONTR BAKERSFIELD CALIFORNIA CA 93304 BANK OF AMERICA NATL ASSOC CA 0 1968-09-24 1969 ... N N 25-Sep-89 1968-11-22 275000.0 0.0 CHGOFF 0.0 275000.0 247500.0
726149 7556880003 RALPH O BURNETT BAKERSTOWN PA 15007 PNC BANK, NATIONAL ASSOCIATION PA 0 1969-04-28 1969 ... N N 21-Jun-95 1969-05-22 27200.0 0.0 P I F 0.0 27200.0 24480.0
720402 7499310010 T T G INC LOS ANGELES CALIF CA 90028 WELLS FARGO BANK NATL ASSOC CA 0 1968-12-03 1969 ... N N 26-Oct-88 1969-05-26 200000.0 0.0 CHGOFF 159769.0 200000.0 180000.0

5 rows × 27 columns

In [10]:
print nat.Name.nunique()
print nat.shape[0]
778662
897137
# loan_record = {} # for i in nat.Name.unique(): # loan_record[i] = [] # len(loan_record.keys()) # for i in range(len(nat)): # print i # loan_record[nat.iloc[i].Name].append(nat.iloc[i].ApprovalFY) # loan_record_df = pd.DataFrame([loan_record]).T # loan_record_df = loan_record_df.rename(columns = {0: 'loan_list'}) # loan_record_df.head() # loan_record_df['loan_start'] = loan_record_df.loan_list.apply(lambda x: min(x))
In [12]:
loan_record_df.head()
Out[12]:
loan_list loan_start
"216" PIANO BAR [2002] 2002
"360" SALON & DAY SPA [2002] 2002
"821" [1999] 1999
"A CLASS ACT" BEAUTY SALON [1989] 1989
"A LA MODE" ICE CREAM PARLOUR [1997] 1997
In [15]:
save_csv(loan_record_df, 'loan_record.csv')
<---- Saving csv file to s3 ---->
In [18]:
!aws s3 ls --human-readable s3://eh-home/ehda-calvin/SBA_study/
2018-10-26 03:13:21  171.1 MiB SBAnational.csv
2018-10-29 03:22:11    9.6 MiB loan_record.csv
2018-10-29 02:42:40    1.4 KiB t0001-10.1080%2F10691898.2018.1434342.csv
2018-10-29 02:42:41  772 Bytes t0002-10.1080%2F10691898.2018.1434342.csv
2018-10-29 02:42:42  785 Bytes t0003-10.1080%2F10691898.2018.1434342.csv
2018-10-29 02:42:43  913 Bytes t0004-10.1080%2F10691898.2018.1434342.csv
2018-10-29 02:42:43  202 Bytes t0005-10.1080%2F10691898.2018.1434342.csv
2018-10-29 02:42:44  203 Bytes t0006-10.1080%2F10691898.2018.1434342.csv
2018-10-29 02:42:45  289 Bytes t0007-10.1080%2F10691898.2018.1434342.csv
2018-10-29 02:42:46  408 Bytes t0008-10.1080%2F10691898.2018.1434342.csv
2018-10-29 02:42:47  204 Bytes t0009-10.1080%2F10691898.2018.1434342.csv
2018-10-29 02:42:48  294 Bytes t0010-10.1080%2F10691898.2018.1434342.csv
2018-10-29 02:42:49  250 Bytes t0011-10.1080%2F10691898.2018.1434342.csv
2018-10-29 02:42:50  316 Bytes t0012-10.1080%2F10691898.2018.1434342.csv
2018-10-29 02:42:50    2.3 KiB t0013-10.1080%2F10691898.2018.1434342.csv
In [20]:
loan_record_df = loan_record_df.reset_index()
loan_record_df = loan_record_df.rename(columns = {'index': 'Name'})
loan_record_df.head()
Out[20]:
Name loan_list loan_start
0 "216" PIANO BAR [2002] 2002
1 "360" SALON & DAY SPA [2002] 2002
2 "821" [1999] 1999
3 "A CLASS ACT" BEAUTY SALON [1989] 1989
4 "A LA MODE" ICE CREAM PARLOUR [1997] 1997
In [21]:
def company_suffix(x):
    # classify the business based on the abbreviation after the business name
    l = x.split()[-1].replace('.', '').upper()
    if l in ['C', 'L', 'I', 'A', 'S', 'D', 'M']:
        if 'L L C' in x.upper():
            l = 'LLC'
        elif len(x) == 1:
            l = l
        else:
            l = x[:-1].split()[-1].replace('.', '').upper()
            
    if l in ['INC', 'INCORPORATED']: # incorporated
        abb = 'INC'
    elif l in ['CO', 'COMPANY']:
        if 'LTD' in x.upper():
            abb = 'LTD'
        else:
            abb = 'CO'
    elif l in ['IN']:
        abb = 'IN'
    elif l in ['LLC']: # limited liability companies
        abb = 'LLC'
    elif l in ['LLP']: # limited liability companies
        abb = 'LLP'
    elif l in ['LL']: # limited liability companies
        abb = 'LL'  
    elif l in ['LTD']: # "LTD" or "Ltd." stands for "limited"
        abb = 'LTD'
    elif l in ['PC']: # professional corporation such as medicine, law and accounting
        abb = 'PC'
    elif l in ['PLLC']: # professional limited liability company
        abb = 'PLLC'
    elif l in ['PA']: # professional association
        abb = 'PA'
    elif l in ['CORP', 'CORPORATION']:
        abb = 'CORP'
    elif l in ['ASSOC', 'ASSOCIATES']:
        abb = 'ASSOC'
    elif l in ['DDS']:
        abb = 'DDS'
    elif l in ['MD', 'CLINIC']:
        abb = 'CLINIC'
    elif l in ['HOSPITAL']:
        abb = 'HOSPITAL'      
    elif l in ['SALON']:
        abb = 'SALON' 
    elif l in ['CONSTRUCTION']:
        abb = 'CONSTRUCTION'
    elif l in ['CLEANERS']:
        abb = 'CLEANERS' 
    elif l in ['LAUNDRY', ]:
        abb = 'LAUNDRY'
    elif l in ['DMD']:
        abb = 'DMD' 
    elif l in ['ENTERPRISES']:
        abb = 'ENTERPRISES'
    elif l in ['CLUB', 'FOUNDATION', 'FUND', 'INSTITUTE', 'SOCIETY', 'UNION', 'SYNDICATE']:
        abb = 'Non-Profit'
    elif l in ['RESTAURANT', 'CAFE', 'GRILL']:
        abb = 'RESTAURANT'
    elif l in ['MARKET', 'MART', 'SUPERMARKET', 'GROCERY', 'DELI', 'PIZZA', 'BAR']:
        abb = 'MARKET'
    elif l in ['SERVICES', 'SERVICE', 'SERV', 'SERVIC']:
        abb = 'SERVICE'   
    elif l in ['FARM']:
        abb = 'FARM'
    elif l in ['HOTEL', 'MOTEL', 'INN']:
        abb = 'HOTEL'   
    else:
        abb = 'NO SUFFIX'
    return abb    
    
In [22]:
loan_record_df['Suffix'] = loan_record_df.Name.apply(company_suffix)
In [23]:
loan_record_df.sample(10)
Out[23]:
Name loan_list loan_start Suffix
65465 Awesome Brand International Co [2005] 2005 CO
514395 ORTEGA HIGHWAY CHEVRON [2003] 2003 NO SUFFIX
736232 Uruk World Trading & Contracti [2008] 2008 NO SUFFIX
333350 IMAGE LEATHER [1997] 1997 NO SUFFIX
772756 YAKITORI RESTAURANT [1994] 1994 RESTAURANT
587722 ROBEKS WALNUT [2004] 2004 NO SUFFIX
456201 MH WELDING [2002] 2002 NO SUFFIX
766571 WORKSAVER MAT'L HDL'G EQ.CO. [2000] 2000 NO SUFFIX
79311 BENCHMARK TEXTILES [1993] 1993 NO SUFFIX
501330 NOVELTY CONE COMPANY, INC. [1999] 1999 INC
# loan_record_df[loan_record_df.Suffix != 'NO SUFFIX'] loan_record_df.Suffix.value_counts()loan_record_df[loan_record_df.Suffix == 'NO SUFFIX'].Name.apply(lambda x: x.split()[-1].replace('.', '').upper()).value_counts().head(50)
In [164]:
loan_record_df.head()
Out[164]:
Name loan_list loan_start Suffix
0 "216" PIANO BAR [2002] 2002 MARKET
1 "360" SALON & DAY SPA [2002] 2002 NO SUFFIX
2 "821" [1999] 1999 NO SUFFIX
3 "A CLASS ACT" BEAUTY SALON [1989] 1989 SALON
4 "A LA MODE" ICE CREAM PARLOUR [1997] 1997 NO SUFFIX
In [25]:
save_csv(loan_record_df, 'extra_company_info.csv')
<---- Saving csv file to s3 ---->
In [106]:
# loan_record_df[loan_record_df.Name.apply(lambda x: x.split()[-1].replace('.', '').upper()).isin(['STORE'])]
In [57]:
def previous_loan(loan_list, year):
    return len(filter(lambda x: x > year, loan_list))

def loan_age(loan_start, year):
    if year <= loan_start:
        age = 0
    else:
        age = year - loan_start
    return age
In [58]:
%%time
print previous_loan(x, 2003)
2
CPU times: user 307 µs, sys: 226 µs, total: 533 µs
Wall time: 304 µs
In [59]:
loan_record_df.head(100).loan_list.apply(previous_loan, args = (1990,)).value_counts()
Out[59]:
1    81
0    15
2     4
Name: loan_list, dtype: int64
In [60]:
loan_record_df.head(100).loan_start.apply(loan_age, args = (1990,)).value_counts()
Out[60]:
0    89
1     6
6     2
5     1
3     1
2     1
Name: loan_start, dtype: int64
In [7]:
def default(x):
    if x == 'CHGOFF':
        return 1
    else:
        return 0

nat['default'] = nat.MIS_Status.apply(default)
In [28]:
nat.ApprovalFY.value_counts().sort_index()
Out[28]:
1966        1
1968        1
1969        3
1970        8
1971       18
1972       25
1973       49
1974       42
1975       29
1976       65
1977      137
1978      239
1979      349
1980      453
1981      602
1982      719
1983     1682
1984     2019
1985     1941
1986     2118
1987     2218
1988     1898
1989    13245
1990    14859
1991    15660
1992    20875
1993    23299
1994    31584
1995    45688
1996    40021
1997    37718
1998    36003
1999    37344
2000    37352
2001    37317
2002    44307
2003    58000
2004    68195
2005    76957
2006    75753
2007    71649
2008    39458
2009    19103
2010    16828
2011    12593
2012     5991
2013     2454
2014      268
Name: ApprovalFY, dtype: int64
In [27]:
nat[nat.Name == 'ABC HOBBYCRAFT'].T
Out[27]:
0
LoanNr_ChkDgt 1000014003
Name ABC HOBBYCRAFT
City EVANSVILLE
State IN
Zip 47711
Bank FIFTH THIRD BANK
BankState OH
NAICS 451120
ApprovalDate 1997-02-28 00:00:00
ApprovalFY 1997
Term 84
NoEmp 4
NewExist 2
CreateJob 0
RetainedJob 0
FranchiseCode 1
UrbanRural 0
RevLineCr N
LowDoc Y
ChgOffDate NaN
DisbursementDate 1999-02-28 00:00:00
DisbursementGross 60000
BalanceGross 0
MIS_Status P I F
ChgOffPrinGr 0
GrAppv 60000
SBA_Appv 48000

Exploratory data analysis

In [9]:
nat.GrAppv.iplot(kind = 'hist', bins = 100, title = 'Histogram of grant approved', xTitle = 'Grant')
In [10]:
# No. of loan each year
nat.groupby('ApprovalFY').count().max(1).iplot(kind = 'bar', title = 'Loan cases')

Default rate

In [11]:
(nat[nat.default == 1].groupby('ApprovalFY').count().max(1) / 
 nat.groupby('ApprovalFY').count().max(1)).iplot(kind = 'bar', title = 'Default rate')
In [12]:
nat[nat.default == 1].groupby('ApprovalFY').SBA_Appv.sum().iplot(kind = 'bar')
In [13]:
nat[nat.default == 1].ChgOffPrinGr.iplot(kind = 'hist', bins = 50)
In [14]:
nat.head().T
Out[14]:
0 1 2 3 4
LoanNr_ChkDgt 1000014003 1000024006 1000034009 1000044001 1000054004
Name ABC HOBBYCRAFT LANDMARK BAR & GRILLE (THE) WHITLOCK DDS, TODD M. BIG BUCKS PAWN & JEWELRY, LLC ANASTASIA CONFECTIONS, INC.
City EVANSVILLE NEW PARIS BLOOMINGTON BROKEN ARROW ORLANDO
State IN IN IN OK FL
Zip 47711 46526 47401 74012 32801
Bank FIFTH THIRD BANK 1ST SOURCE BANK GRANT COUNTY STATE BANK 1ST NATL BK & TR CO OF BROKEN FLORIDA BUS. DEVEL CORP
BankState OH IN IN OK FL
NAICS 451120 722410 621210 0 0
ApprovalDate 1997-02-28 00:00:00 1997-02-28 00:00:00 1997-02-28 00:00:00 1997-02-28 00:00:00 1997-02-28 00:00:00
ApprovalFY 1997 1997 1997 1997 1997
Term 84 60 180 60 240
NoEmp 4 2 7 2 14
NewExist 2 2 1 1 1
CreateJob 0 0 0 0 7
RetainedJob 0 0 0 0 7
FranchiseCode 1 1 1 1 1
UrbanRural 0 0 0 0 0
RevLineCr N N N N N
LowDoc Y Y N Y N
ChgOffDate NaN NaN NaN NaN NaN
DisbursementDate 1999-02-28 00:00:00 1997-05-31 00:00:00 1997-12-31 00:00:00 1997-06-30 00:00:00 1997-05-14 00:00:00
DisbursementGross 60000 40000 287000 35000 229000
BalanceGross 0 0 0 0 0
MIS_Status P I F P I F P I F P I F P I F
ChgOffPrinGr 0 0 0 0 0
GrAppv 60000 40000 287000 35000 229000
SBA_Appv 48000 32000 215250 28000 229000
default 0 0 0 0 0
In [15]:
nat.State.value_counts().head()
Out[15]:
CA    126846
TX     68102
NY     55897
FL     40719
PA     34290
Name: State, dtype: int64
In [16]:
# default rate
nat['default'].sum() / len(nat)
Out[16]:
0.17228697508765795
In [17]:
nat.default.value_counts()
Out[17]:
0    719521
1    149767
Name: default, dtype: int64

LowDoc

LowDoc (Y = Yes, N = No): In order to process more loans efficiently, a “LowDoc Loan” program was implemented where loans under $150,000 can be processed using a one-page application. “Yes” indicates loans with a one-page application, and “No” indicates loans with more information attached to the application. In this dataset, 87.31% are coded as N (No) and 12.31% as Y (Yes) for a total of 99.62%. It is worth noting that 0.38% have other values (0, 1, A, C, R, S); these are data entry errors.

In [18]:
nat.LowDoc.sample(10)
Out[18]:
42407     N
796939    N
817802    N
453118    N
849883    N
401453    N
814278    N
870645    N
36434     N
765376    N
Name: LowDoc, dtype: object
In [19]:
def clean_LowDoc(x):
    if x == 'Y':
        return 1
    elif x == 'N':
        return 0
    else:
        return np.NaN
    
nat['LowDoc'] = nat.LowDoc.apply(clean_LowDoc)

NAICS

NAICS (North American Industry Classification System): This is a 2- through 6-digit hierarchical classification system used by Federal statistical agencies in classifying business establishments for the collection, analysis, and presentation of statistical data describing the U.S. economy. The first two digits of the NAICS classification represent the economic sector.

In [20]:
table(3)
Out[20]:
Sector Description
0 11 Agriculture, forestry, fishing and hunting
1 21 Mining, quarrying, and oil and gas extraction
2 22 Utilities
3 23 Construction
4 31–33 Manufacturing
5 42 Wholesale trade
6 44–45 Retail trade
7 48–49 Transportation and warehousing
8 51 Information
9 52 Finance and insurance
10 53 Real estate and rental and leasing
11 54 Professional, scientific, and technical services
12 55 Management of companies and enterprises
13 56 Administrative and support and waste managemen...
14 61 Educational services
15 62 Health care and social assistance
16 71 Arts, entertainment, and recreation
17 72 Accommodation and food services
18 81 Other services (except public administration)
19 92 Public administration
In [21]:
nat.NAICS.value_counts().head()
Out[21]:
0         181207
722110     27941
722211     19435
811111     14235
621210     13756
Name: NAICS, dtype: int64
In [22]:
# nat.NAICS.apply(lambda x: len(str(x))).value_counts()
def fix_naics(x):
    if x == 0:
        naics = '999999'
    else:
        naics = str(x)
    return naics

def naics_sector(x):
    if x == '999999':
        naics = '999999'
    else:
        naics = int(str(x)[:2])
        if (naics >= 31) and (naics <= 33):
            naics = '31'
        elif (naics >= 44) and (naics <= 45):
            naics = '44'
        elif (naics >= 48) and (naics <= 49):
            naics = '48'
        else:
            naics = str(naics)
    return naics
    
In [23]:
nat['NAICS_group'] = nat.NAICS.apply(naics_sector)
In [24]:
nat['NAICS_group'].value_counts().sort_index()
Out[24]:
0     181207
11      8773
21      1817
22       656
23     65785
31     66636
42     47881
44    125289
48     22373
51     11319
52      9467
53     13472
54     67465
55       254
56     32431
61      6390
62     54632
71     14488
72     67403
81     71325
92       225
Name: NAICS_group, dtype: int64
In [25]:
def naics_defaut_rate(x):
    default_dict = {'11': 9, '21': 8, '22': 14, '23': 23, '31': 19, '32': 16, '33': 14, '42': 19, 
               '44': 22, '45': 23, '48': 27, '49': 23, '51': 25, '52': 28, '53': 29, '54': 19, 
               '55': 10, '56': 24, '61': 24, '62': 10, '71': 21, '72': 22, '81': 20, '92': 15}
    naics = str(x)[:2]
    if naics not in default_dict.keys():
        default_rate = np.NaN
    else:
        default_rate = default_dict[naics]
    return default_rate
In [26]:
nat['NAICS_default_rate'] = nat.NAICS.apply(naics_defaut_rate)

Loans Backed by Real Estate

Whether a loan is backed by real estate (possession of land) is another risk indicator that is discussed. The rationale for this indicator is that the value of the land is often large enough to cover the amount of any principal outstanding, thereby reducing the probability of default.

Since the term of the loan is a function of the expected lifetime of the assets, loans backed by real estate will have terms 20 years or greater (≥240 months) and are the only loans granted for such a long term, whereas loans not backed by real estate will have terms less than 20 years (<240 months). Therefore, the authors created a dummy variable, “RealEstate,” where “RealEstate” = 1 if “Term” ≥240 months and “RealEstate” = 0 if “Term” <240 months.

In [27]:
def RealEstate(x):
    if x >= 240:
        return 1
    else:
        return 0
In [28]:
nat['RealEstate'] = nat.Term.apply(RealEstate)
In [29]:
nat.head()
Out[29]:
LoanNr_ChkDgt Name City State Zip Bank BankState NAICS ApprovalDate ApprovalFY ... DisbursementGross BalanceGross MIS_Status ChgOffPrinGr GrAppv SBA_Appv default NAICS_group NAICS_default_rate RealEstate
0 1000014003 ABC HOBBYCRAFT EVANSVILLE IN 47711 FIFTH THIRD BANK OH 451120 1997-02-28 1997 ... 60000.0 0.0 P I F 0.0 60000.0 48000.0 0 44 23.0 0
1 1000024006 LANDMARK BAR & GRILLE (THE) NEW PARIS IN 46526 1ST SOURCE BANK IN 722410 1997-02-28 1997 ... 40000.0 0.0 P I F 0.0 40000.0 32000.0 0 72 22.0 0
2 1000034009 WHITLOCK DDS, TODD M. BLOOMINGTON IN 47401 GRANT COUNTY STATE BANK IN 621210 1997-02-28 1997 ... 287000.0 0.0 P I F 0.0 287000.0 215250.0 0 62 10.0 0
3 1000044001 BIG BUCKS PAWN & JEWELRY, LLC BROKEN ARROW OK 74012 1ST NATL BK & TR CO OF BROKEN OK 0 1997-02-28 1997 ... 35000.0 0.0 P I F 0.0 35000.0 28000.0 0 0 NaN 0
4 1000054004 ANASTASIA CONFECTIONS, INC. ORLANDO FL 32801 FLORIDA BUS. DEVEL CORP FL 0 1997-02-28 1997 ... 229000.0 0.0 P I F 0.0 229000.0 229000.0 0 0 NaN 1

5 rows × 31 columns

SBA's Guaranteed Portion of Approved Loan

The portion which is the percentage of the loan that is guaranteed by SBA (represented as “Portion” in the dataset) is a final risk indicator that is discussed in our courses. This is one of the variables that the authors generated calculating the ratio of the amount of the loan SBA guarantees and the gross amount approved by the bank (SBA_Appv/GrAppv)

In [30]:
nat['SBA_ratio'] = nat.SBA_Appv / nat.GrAppv

nat.SBA_ratio.iplot(kind = 'hist', bins = 10, title = 'Histogram of SBA approved ratio', xTitle = 'SBA_ratio')
In [31]:
nat[nat.default == 1].SBA_ratio.iplot(kind = 'hist', bins = 10)
In [32]:
nat.groupby('default').SBA_ratio.mean()
Out[32]:
default
0    0.718599
1    0.632673
Name: SBA_ratio, dtype: float64
In [33]:
nat.head().T
Out[33]:
0 1 2 3 4
LoanNr_ChkDgt 1000014003 1000024006 1000034009 1000044001 1000054004
Name ABC HOBBYCRAFT LANDMARK BAR & GRILLE (THE) WHITLOCK DDS, TODD M. BIG BUCKS PAWN & JEWELRY, LLC ANASTASIA CONFECTIONS, INC.
City EVANSVILLE NEW PARIS BLOOMINGTON BROKEN ARROW ORLANDO
State IN IN IN OK FL
Zip 47711 46526 47401 74012 32801
Bank FIFTH THIRD BANK 1ST SOURCE BANK GRANT COUNTY STATE BANK 1ST NATL BK & TR CO OF BROKEN FLORIDA BUS. DEVEL CORP
BankState OH IN IN OK FL
NAICS 451120 722410 621210 0 0
ApprovalDate 1997-02-28 00:00:00 1997-02-28 00:00:00 1997-02-28 00:00:00 1997-02-28 00:00:00 1997-02-28 00:00:00
ApprovalFY 1997 1997 1997 1997 1997
Term 84 60 180 60 240
NoEmp 4 2 7 2 14
NewExist 2 2 1 1 1
CreateJob 0 0 0 0 7
RetainedJob 0 0 0 0 7
FranchiseCode 1 1 1 1 1
UrbanRural 0 0 0 0 0
RevLineCr N N N N N
LowDoc 1 1 0 1 0
ChgOffDate NaN NaN NaN NaN NaN
DisbursementDate 1999-02-28 00:00:00 1997-05-31 00:00:00 1997-12-31 00:00:00 1997-06-30 00:00:00 1997-05-14 00:00:00
DisbursementGross 60000 40000 287000 35000 229000
BalanceGross 0 0 0 0 0
MIS_Status P I F P I F P I F P I F P I F
ChgOffPrinGr 0 0 0 0 0
GrAppv 60000 40000 287000 35000 229000
SBA_Appv 48000 32000 215250 28000 229000
default 0 0 0 0 0
NAICS_group 44 72 62 0 0
NAICS_default_rate 23 22 10 NaN NaN
RealEstate 0 0 0 0 1
SBA_ratio 0.8 0.8 0.75 0.8 1

RevLineCr

Revolving Line of Credit : Y = Yes

In [34]:
nat.RevLineCr.value_counts()
Out[34]:
N    391402
0    257431
Y    200631
T     15239
1        23
R        14
`        11
2         6
C         2
5         1
7         1
4         1
-         1
A         1
Q         1
3         1
,         1
.         1
Name: RevLineCr, dtype: int64
In [35]:
def clean_RevLineCr(x):
    if x == 'Y':
        return 1
    else:
        return 0

nat['RevLineCr'] = nat.RevLineCr.apply(clean_RevLineCr)

Zip code

A ZIP Code is a postal code used by the United States Postal Service (USPS) in a system it introduced in 1963.

The first digit of the ZIP Code is allocated as follows:

0 = Connecticut (CT), Massachusetts (MA), Maine (ME), New Hampshire (NH), New Jersey (NJ), New York (NY, Fishers Island only), Puerto Rico (PR), Rhode Island (RI), Vermont (VT), Virgin Islands (VI), Army Post Office Europe (AE), Fleet Post Office Europe (AE) 1 = Delaware (DE), New York (NY), Pennsylvania (PA) 2 = District of Columbia (DC), Maryland (MD), North Carolina (NC), South Carolina (SC), Virginia (VA), West Virginia (WV) 3 = Alabama (AL), Florida (FL), Georgia (GA), Mississippi (MS), Tennessee (TN), Army Post Office Americas (AA), Fleet Post Office Americas (AA) 4 = Indiana (IN), Kentucky (KY), Michigan (MI), Ohio (OH) 5 = Iowa (IA), Minnesota (MN), Montana (MT), North Dakota (ND), South Dakota (SD), Wisconsin (WI) 6 = Illinois (IL), Kansas (KS), Missouri (MO), Nebraska (NE) 7 = Arkansas (AR), Louisiana (LA), Oklahoma (OK), Texas (TX) 8 = Arizona (AZ), Colorado (CO), Idaho (ID), New Mexico (NM), Nevada (NV), Utah (UT), Wyoming (WY) 9 = Alaska (AK), American Samoa (AS), California (CA), Guam (GU), Hawaii (HI), Marshall Islands (MH), Federated States of Micronesia (FM), Northern Mariana Islands (MP), Oregon (OR), Palau (PW), Washington (WA), Army Post Office Pacific (AP), Fleet Post Office Pacific (AP)

The next two digits represent the sectional center facility (SCF) (e.g. 477xx = Vanderburgh County, Indiana), and the fourth and fifth digits represent the area of the city (if in a metropolitan area), or a village/town (outside metro areas): 47722 (4=Indiana, 77=Vanderburgh County, 22=University of Evansville area). When a sectional center facility's area crosses state lines, that facility is assigned separate three-digit prefixes for the states that it serves.

(https://en.wikipedia.org/wiki/ZIP_Code)

Connecticut (CT) Massachusetts (MA) Maine (ME) New Hampshire (NH) New Jersey (NJ) New York (NY) Puerto Rico (PR) Rhode Island (RI) Vermont (VT) Virgin Islands (VI) Army Post Office Europe (AE) Fleet Post Office Europe (AE)

'CT','MA','ME','NH','NJ','NY','PR','RI','VT','VI','AE','AE'

In [36]:
nat['Zip_length'] = nat.Zip.apply(lambda x: len(str(x)))
nat['Zip_length'].value_counts()
Out[36]:
5    777484
4     91392
1       377
3        33
2         2
Name: Zip_length, dtype: int64

Seem like some of the zip code has missing values!

In [37]:
def zip_5d(Zip, State):
    # fix zip code with error
    zero_head = ['CT','MA','ME','NH','NJ','NY','PR','RI','VT','VI','AE','AE']
    if (Zip == 9999) or (Zip == 99999):
        zip_code = '99999'
    elif len(str(Zip)) == 4:
        if State in zero_head:
            zip_code = '0' + str(Zip)
        else:
            zip_code = '99999'
    elif len(str(Zip)) == 3:
        if State in zero_head:
            zip_code = '00' + str(Zip)
        else:
            zip_code = '99999'
    elif len(str(Zip)) < 3:
        zip_code = '99999'
    else:
        zip_code = str(Zip)
    return zip_code

nat['Zip5d'] = nat.apply(lambda x: zip_5d(x['Zip'], x['State']), axis = 1)
In [38]:
nat[nat['Zip5d'] == '99999'].shape
Out[38]:
(828, 34)
In [39]:
nat['Zip3d'] = nat.Zip5d.str[:3]
In [40]:
nat['Zip5d'].value_counts().head()
Out[40]:
10001    924
90015    918
99999    828
93401    744
90010    732
Name: Zip5d, dtype: int64
In [41]:
nat['Zip3d'].value_counts().head()
Out[41]:
900    12286
770    10073
840     7916
750     7824
917     6962
Name: Zip3d, dtype: int64
In [42]:
nat = nat[['LoanNr_ChkDgt', 'Name', 'City', 'State', 'Zip5d', 'Zip3d', 'Bank', 'BankState', 'NAICS', 'NAICS_group', 
           'NAICS_default_rate', 'ApprovalDate', 'ApprovalFY', 'Term', 'NoEmp', 'NewExist', 'CreateJob', 'RetainedJob', 
           'FranchiseCode', 'UrbanRural', 'RevLineCr', 'LowDoc', 'ChgOffDate', 'DisbursementDate', 'DisbursementGross', 
           'BalanceGross', 'MIS_Status', 'ChgOffPrinGr', 'GrAppv', 'SBA_Appv', 'default', 'RealEstate', 'SBA_ratio']]
In [46]:
pd.isnull(nat).sum()
Out[46]:
LoanNr_ChkDgt              0
Name                      12
City                       0
State                      8
Zip5d                      0
Zip3d                      0
Bank                    1037
BankState               1043
NAICS                      0
NAICS_group                0
NAICS_default_rate    181207
ApprovalDate               0
ApprovalFY                 0
Term                       0
NoEmp                      0
NewExist                 134
CreateJob                  0
RetainedJob                0
FranchiseCode              0
UrbanRural                 0
RevLineCr                  0
LowDoc                  5999
ChgOffDate            714886
DisbursementDate        2011
DisbursementGross          0
BalanceGross               0
MIS_Status                 0
ChgOffPrinGr               0
GrAppv                     0
SBA_Appv                   0
default                    0
RealEstate                 0
SBA_ratio                  0
dtype: int64

Saving datasets

In [47]:
nat.to_csv(path_data + 'SBAnational_new.csv', sep = ';', index = False)
In [48]:
!ls -all -h ../large_data_files/ASA_loan_data/
total 711816
drwxr-xr-x  6 ehda-calvin  staff   192B Oct 22 16:46 .
drwxr-xr-x  4 ehda-calvin  staff   128B Aug 28 13:47 ..
-rw-r--r--@ 1 ehda-calvin  staff   6.0K Aug 28 13:47 .DS_Store
-rwxr-xr-x@ 1 ehda-calvin  staff   380K Aug 28 12:18 SBAcase_11_13_17.csv
-rwxr-xr-x@ 1 ehda-calvin  staff   171M Aug 28 13:38 SBAnational.csv
-rw-r--r--  1 ehda-calvin  staff   165M Oct 24 14:27 SBAnational_new.csv
In [49]:
nat03_04 = nat[(nat.ApprovalFY >= 2003) & (nat.ApprovalFY < 2005)]
nat05_06 = nat[(nat.ApprovalFY >= 2005) & (nat.ApprovalFY < 2007)]
nat07_08 = nat[(nat.ApprovalFY >= 2007) & (nat.ApprovalFY < 2009)]
print nat03_04.shape, nat05_06.shape, nat07_08.shape
(126195, 33) (152714, 33) (111107, 33)
In [50]:
nat03_04.to_csv('additional_data/nat03_04.csv', index=False)
nat05_06.to_csv('additional_data/nat05_06.csv', index=False)
nat07_08.to_csv('additional_data/nat07_08.csv', index=False)